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Abstract 

We  consider  an  adaptive  finite  state  controlled  Markov  chain  with  par¬ 
tial  state  information,  motivated  by  a  class  of  replacement  problems.  We 
present  parameter  estimation  techniques  based  on  the  information  avail¬ 
able  after  actions  that  reset  the  state  to  a  known  value  are  taken.  We 
prove  that  the  parameter  estimates  converge  w.p.l  to  the  true  (unknown) 
parameter,  under  the  feedback  structure  induced  by  a  certainty  equivalent 
adaptive  policy.  We  also  show  that  the  adaptive  policy  is  self-optimizing, 
in  a  long-run  average  sense,  for  any  (measurable)  sequence  of  parameter 
estimates  converging  w.p.l  to  the  true  parameter. 
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I.  Introduction 


In  recent  years,  there  has  been  a  considerable  amount  of  work  in  stochastic  adaptive 
control  [10]— [11].  However,  aside  from  results  for  linear  systems,  little  progress  has  been 
made  on  problems  with  incomplete  or  noisy  state  observations.  An  initial  step  in  this 
direction  was  taken  in  [1],  where  the  adaptive  estimation  of  the  state  of  a  finite  state 
Markov  chain,  with  incomplete  state  information,  and  with  the  state  transition  probabilities 
depending  on  unknown  parameters,  is  studied.  This  adaptive  estimation  problem  is  that 
of  computing  recursive  estimates  of  the  conditional  probability  vector  of  the  state  at  time 
t ,  given  all  the  past  observations,  when  the  transition  matrix  P  is  not  completely  known, 
i.e.,  it  depends  on  a  vector  of  unknown  parameters  6  —  this  dependence  is  expressed  as 
P(9).  In  [1]  we  use  the  previously  derived  recursive  filter  for  the  conditional  probabilities, 
and  simultaneously  recursively  estimate  the  parameters,  using  the  most  recent  parameter 
estimates  to  update  the  filter.  This  adaptive  estimation  algorithm  is  then  analyzed  via  the 
Ordinary  Differential  Equation  (ODE)  Method  [12]— [13].  The  convergence  of  the  recursive 
parameter  estimates  is  established,  and  optimality  of  the  adaptive  state  estimator  is  proved, 
in  a  long-run  average  sense. 

In  [7]-[8],  we  began  to  investigate  the  application  of  similar  techniques  to  the  control 
of  adaptive  finite  state  Markov  chains  with  incomplete  observations.  One  interesting  set 
of  problems  for  which  some  results  are  available  when  the  parameters  are  known  are  those 
involving  quality  control,  replacement,  and  repair  of  a  unit  in  a  manufacturing  system  or 
communication  network  [9],  [15],  [18].  We  formulated  the  adaptive  version  of  a  problem 
of  this  type  in  the  above  references;  however,  the  presence  of  feedback  makes  this  problem 
much  more  difficult  than  that  of  [1],  Discontinuities  in  the  optimal  control  strategies  lead  to 
averaged  ODE’s  with  discontinuous  right-hand  sides  that  cannot  be  handled  by  currently 
available  methods. 

In  this  paper  we  present  parameter  estimation  techniques  based  on  the  information 
available  after  actions  that  reset  the  state  to  a  known  value  are  taken.  At  these  times, 
the  (augmented)  state  process  regenerates,  its  future  evolution  becoming  independent  of 
the  past.  We  prove  (by  means  of  the  ODE  method)  w.p.l  convergence  of  the  parameter 
estimates  to  the  true  (unknown)  parameter  do,  for  a  parameter  estimation  scheme  of  this 
type.  Then,  given  any  sequence  of  parameter  estimates  which  converges  w.p.l  to  0o,  and 
which  is  measurable  with  respect  to  the  filtration  generated  by  the  observations,  we  show 
that  a  certainty  equivalent  adaptive  policy  is  self-optimizing.  The  latter  is  obtained  by 
an  analysis  which  uses  the  known  (threshold)  structure  of  optimal  policies  for  problems 
with  known  parameters.  Our  analysis  is  of  particular  interest  since  the  nice  formalism 
recently  presented  in  [17]  cannot  be  directly  applied  in  the  present  situation:  here  the 
state  is  only  partially  observed  and  the  optimal  policy  is  not  a  continuous  function  of  6. 
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The  methodology  exposed  in  the  analysis  relies  largely  on  the  w.p.l  convergence  to  6 o  of 
the  parameter  estimates,  and  the  continuity  in  the  parameterization  of  quantities  in  the 
model,  like  P{6)  and  the  solutions  to  the  corresponding  optimality  equations.  Hence,  this 
methodology  is  also  applicable  to  a  more  general  situation  than  the  one  presented  here; 
see  [6].  In  addition,  we  note  that  the  feedback  structure  induced  by  our  adaptive  policy 
obviates  the  need  for,  e.g.  forced  choice  schemes,  c.f.  [11]. 

II.  A  Partially  Observed  Binary  Replacement  Problem 

Consider  a  situation  in  which  a  system,  such  as  a  machine,  production  process,  or 
computer  communications  network  can  fail.  The  (core)  state  Xt  of  the  system  can  either 
be  good  (0),  or  failed  (1);  let  X  :=  {0, 1}.  The  available  control  actions  (or  decisions)  are 
to  operate  the  system  in  its  current  condition  (0),  or  to  reset/replace  the  system  to  an  as 
new  condition  (1);  let  U  :=  {0,1}.  Assume  for  the  moment  that  there  is  an  underlying 
probability  space  The  process  {^t}te]N0  is  modeled  as  a  controlled  finite  state 

Markov  chain,  where  we  have  that 

v{xt+1  =j\xt  =  i ,xt.u...,x0]  ut  = 

=  [p(u)]i,j  5  *  €  l*o  :=  {0,1,2,...},  (2.1) 

and  the  state  transition  probability  matrices  are  given  as 


Here  6  E  [0, 1]  gives  the  failure  rate  of  the  system.  Only  imperfect  observations  of  {^<}t€jvj0 
are  available  in  the  form  of  a  random  process  Yt  gives  a  correct  observation  of  Xt 

with  probability  q ,  when  Ut- 1  =  0,  whereas  if  Ut- 1  =  1  then  Yt  =  Xt  =  0.  More  precisely, 
YtE  Y  =  {0,1}  and 

V{Yt+1  =  i\Yu...,Y1;Xt+i  =  i,Xu...,X0;  Ut  =  0,ff«_,,...,lf0} 

=  P{Yt+1  =  *  |  Xt+1  =  «;  Ut  =  0}  =:  q ,  t  E  1N0  .  (2.3) 

It  suffices  to  consider  only  0.5  <  q  <  1.  The  cases  q  =  0.5  and  q  =  1  correspond  to 
the  completely  unobserved  and  completely  observed  situations,  respectively;  we  restrict  our 
analysis  to  the  situation  of  strict  partial  observability,  i.e.,  q  <  1.  The  one-step  cost  c(x,  u )  is 
defined  as  c(0, 0)  =  0,  c(l,0)  =  C,  c(z,l)  =  R,  where  0  <  C  <  R.  Probability  distribution 
vectors  on  X  are  elements  of  A  :=  {p  E  1R2  :  p  =  [1  -  p ,  p] ,  0  <  p  <  1 }.  Thus,  each  p  E  A 
can  be  uniquely  identified  with  a  scalar  p  E  [0,1],  as  indicated.  Initially,  there  is  a  given 
probability  0  <  po  <  1  that  the  system  is  failed,  an  action  is  taken,  and  the  state  evolves 
according  to  (2.2);  a  first  observation  is  received,  another  action  is  taken;  and  so  on. 
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An  (admissible)  control  law,  policy,  or  strategy  x  is  a  rule  for  selecting  the  actions 
Ut,  based  on  ht  =  (po,Ua,Y\,. . .  ,Yt-\,Ut-uUt),  where  ht  is  the  available  information  at 
time  t.  The  canonical  sample  path  space  isfi  =  XxUx(XxYx  U)°°,  and  B  denotes 
the  Borel  <7-algebra  obtained  by  endowing  ft  with  the  discrete  topology.  Then  to  each 
admissible  strategy  x  and  0  <  po  <  1,  we  associate  the  average  cost 


J(x,Po)  ■=  limsup  E*o 

n-+oo 


t= 0 


[AC) 


where  E*0  is  the  expectation  with  respect  to  an  appropriate  marginal  of  the  (unique)  prob¬ 
ability  measure  V*0  on  B  induced  by  po  and  the  strategy  x;  see  [2],  [10].  The  optimal  (AC) 
control  (or  decision)  problem  is  that  of  selecting  a  strategy  such  that  the  average  cost  is 
minimized,  over  all  admissible  strategies.  The  optimal  (AC)  cost  function  is  defined  as 
T(po)  :=  inf{7(x,po)  :  x  is  an  admissible  strategy},  for  0  <  po  <  1. 

7T 

A.  Information  States 


It  is  well  known  that  the  conditional  probability  distribution  process,  whose  ith  com¬ 
ponent  is  given  by 


Pt,] 


i  |  Yt,...,Y\\  Ut-i,.. .  ,Uo}, 


te  IN,  Po  :=  [1  ~Po,Po], 


constitutes  an  information  state  (or  statistic  sufficient  for  control)  [2],  [4],  [5],  [10],  [11];  for 
this  problem,  it  can  be  written  as  pt  =  [1  -  pt,Pt\,  where  p<  is  the  conditional  probability 
of  the  process  being  in  the  failed  state. 

A  separated  strategy  is  a  sequence  of  maps  x  =  (x0, xi, X2, . . .),  where  x*  :  [0,1]  — ►  U. 
When  xt(-)  =  x(-)  for  all  values  of  t,  then  the  policy  is  said  to  be  stationary.  Then  the 
partially  observed,  average  cost  problem  is  equivalent  (i.e.,  equal  minimum  costs  for  each 
po)  to  the  completely  observed  problem,  with  state  pt  and  state  space  [0,1],  of  finding  a 
separated  admissible  strategy  which  minimizes 

J(k,Po)  :=  limsup  E*0 

n— oo 

where  c(p,  u)  =  (1  —  p)c(0,u)  +  pc( l,u) .  Note  that  c(p,0)  =  pC  and  c(p,  1)  =  R.  Using 
Bayes’  rule,  it  is  easily  shown  that  pt  can  be  computed  recursively,  as  follows: 


Pt+i  =T(l,puUt)Yt+l  +  T(0,pt,Ut)(l-Yt+1),  (2.4) 


where 

U(l,p,0)  =  (1  -  g)(l  -  p)(l  -6)  +  q(p(  1  -  9)  +  8]  =  1  -  F(0,p,0), 
V(l,p,l)  =  0,  F(0,P,1)  =  1> 
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(2.5) 

(2.6) 


(2.7) 


no,p,o) 


(l-g)[p(l -*)  +  *] 
V(0,p,0) 


r(i,p,o) 


g[p(  i  - #)  +  0} 
V(l,p,0 ) 


r(y,p,l)  =  0;  i/  =  0,l,  p  €  [0, 1] . 


(2.8) 


Here  V(y,p,u)  is  interpreted  as  the  (one-step  ahead)  conditional  probability  of  the  obser¬ 
vation  being  y  given  the  decision  u  and  an  a  priori  probability  p  of  the  state  being  failed. 
Likewise,  T(y,p,u )  is  interpreted  as  the  a  posteriori  conditional  probability  of  the  unit 
being  failed  given  that  decision  u  was  made,  observation  y  obtained,  and  an  a  priori  prob¬ 
ability  p.  Let  /[A]  denote  the  indicator  function  of  the  event  A.  A  well  known  property  of 
the  process  {p<}^0  is  the  following  [4]. 

Lemma  2.1.  {pt}^0  is  a  controlled  Markov  process,  and  its  state  transition  probabilities 
are  given  by 


^PoiPt+i  €  B  |  Pi  -  Pi  Ut  =  u} 

=  J2  V(y,p,u)l[T(y,p,u)£B]  =:>C(B\p,u),  (2.9) 

y€Y 

for  all  (Borel)  subsets  B  of  [0,1]. 


III.  The  Structure  of  Optimal  Policies. 

Consider  the  optimal  control  problem  corresponding  to  each  parameter  value  9  £  [0, 1]. 
Then,  the  existence  of  solutions  to  the  corresponding  (average  cost)  optimality  equation 
follows  from  the  existence  of  a  reset/repair  action  [6],  [9],  [15]— [16].  We  summarize  these 
results  as  follows;  dependence  on  9  is  made  explicit. 

Theorem  3.1.  Assume  q  £  [0.5, 1),  9  £  [0, 1]. 

(i)  There  exist  a  constant  0  <  Tg  <  R  and  a  concave,  nondecreasing  map  h$:  [0,1]  — > 
[0,i£],  with  hg(0)  =  0,  such  that 

n  +  he(p)  =  minj/e(p);  i?}  ,  (3.1) 

where 

l 

MP)  :=pC  +  £  V(y, p,  0;  9)  he(T(y,  p,  0;  0)).  (3.2) 

v=o 

(ii )  Any  stationary  separated  policy  that  achieves  the  minimum  in  (3.1)  is  average  cost 
optimal;  the  minimum  cost  is  TJ,  for  any  value  of  po- 

The  following  will  be  used  in  the  sequel,  and  its  proof  is  given  in  the  Appendix. 
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Corollary  3.1.  Assume  q  £  [0.5,1)  and  0  £  [0,1]. 

(i)  Any  concave  and  nondecreasing  solution  he(-)  of  (3.1 )  is  continuous  on  [0, 1]; 

(ii)  furthermore,  there  is  only  one  such  solution  satisfying  hg( 0)  =  0. 

Henceforth,  the  dependence  of  V*0  on  po  will  be  omitted,  in  view  of  Theorem  3.1. 
Equation  (3.1)  can  then  be  used  to  determine  the  structure  of  the  optimal  policies  [6],  [9], 

[15]- 


Theorem  3.2.  Assume  q  £  [0.5, 1)  and  0  £  (0, 1]. 

0) 


C(l  +  0) 

e 


<  R  <=> 


R 


then  the  policy  “operate  (Ut  =  0)  for  all  pt  £  [0, 1]”  is  average  cost  optimal. 

(H)  ^ 


R  < 


C(l  +  fl) 

e 


<=>  0  < 


C 

R-C' 


then  there  exists  a  threshold  policy  which  is  average  cost  optimal;  i.e.,  there  exists  ot(0)  £ 
(0, 1)  such  that  it  is  optimal  to  operate  (Ut  =  0 )  for  pt  £  [O,a(0)),  and  to  repair  (Ut  =  1) 
for  pt  £  [a(0),  1]. 


IV.  The  Adaptive  Binary  Replacement  Problem 

If  the  parameter  0  is  unknown,  we  cannot  compute  pt,  nor  can  we  directly  solve  the 
optimal  control  problem.  The  enforced  certainty  equivalence  approach  which  we  will  adopt 
involves  simultaneously  computing  recursive  estimates  9t  of  the  unknown  parameter,  and  pt 
of  the  information  state,  and  using  the  latest  available  parameter  estimate  in  the  filtering 
equation  (2.4)  to  compute  the  next  estimate  pt+i;  the  decision  Ut  is  made  taking  6t  and 
pt  as  if  they  were  the  true  (correct)  values.  Let  0$  :=  [£,£']  be  the  parameter  set  in 
which  6t  is  allowed  to  take  its  values,  where  S  is  an  arbitrarily  small  positive  number  and 
S'  =  min{l,  -  6}.  For  decision-making,  we  define  the  set  OV  =  {7r(- ;0)}tf6e<  of 
optimal  threshold  policies  described  above,  parameterized  by  6.  Thus,  we  conclude  from 
Theorem  3.2  (ii)  that  0  <  a(6)  <  1,  for  each  6  £  0£,  where  a(6 )  denotes  the  dependence 
of  the  threshold  on  6.  We  also  let  6o  denote  the  (unknown)  true  value  of  the  parameter, 
which  we  assume  to  be  constant  and  an  element  of  the  interior  of  0^.  The  following  result, 
on  the  continuity  in  0  of  the  optimal  cost,  the  value  function  and  the  threshold,  is  proved 
in  [3,  Theorem  A.l]. 

Theorem  4.1.  Assume  q  £  [0.5, 1).  Let  0  <  6  <  1.  Then  for  0  £  Qs,  we  have  that: 

(i)  the  pair  (T g,hg)  is  continuous  in  0; 

(ii)  there  exists  a  unique  a(0)  £  (0, 1)  such  that  fg(a(0))  =  R; 
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(iii )  q(-)  is  continuous  on  Qs- 

Observe  that  by  Theorem  4.1  (iii)  and  since  0$  is  compact,  there  is  a  number  o*  <  1 
such  that,  for  all  6  €  Qs,  0  <  a(6)  <  a*. 

A.  Adaptive  Policy. 

Given  a  sequence  of  estimates  {<?<}^0  of  6q,  compute  the  control  action  at  each  time 
t  €  INo  by 

Ut  =  *(pt;6t),  tt (•;•)€  OP,  (4.1) 

where  the  conditional  probability  estimate  is  computed  recursively  via 


Pt+i  =T(l,pt,ir(pt;6t);6t+i)  -  Yt+i 

+  T(p,pi,ir(pt;0i);8t+i)  ■  (1  -  T<+i),  po  =  Po-  (4.2) 

We  will  denote  by  ira  the  policy  given  by  (4.1)  and  (4.2). 

B.  Parameter  Estimation. 

There  are  a  number  of  ways  to  compute  the  estimates  we  consider  here  only  recursive 
schemes.  One  method,  discussed  in  [7]-[8],  updates  the  parameter  estimate  6t  at  each  time 
step  t ,  and  is  similar  to  that  used  for  adaptive  estimation  in  [1].  However,  the  analysis  of 
convergence  is  very  difficult,  due  to  the  complex  feedback  structure  induced.  We  concentrate 
here  on  algorithms  which  update  6t  after  each  repair.  The  advantage  of  this  approach  is  that 
when  a  repair  event  occurs,  the  6tate  of  the  system  is  reset  to  the  “as  new”  state,  and  thus 
the  processes  of  interest  are  identically  distributed  between  these  events.  On  the  other  hand, 
the  convergence  rate  may  be  too  slow,  and  thus  some  forcing  may  be  needed  to  accelerate 
the  convergence.  Algorithms  that  take  advantage  of  analogous  regenerative  behavior  in 
some  queueing  problems,  by  updating  after  each  busy  period,  have  been  presented  in  [13]. 
The  next  result  is  a  direct  consequence  of  [3,  Theorem  A .2]. 

Theorem  4.2.  Under  the  adaptive  policy  7r“,  regeneration  occurs  infinitely  often  (i.o.), 
i.e., 

V*‘{Vt  =  1,  i.o.}  =  1. 

Let  Tk  be  the  kih  repair  time  under  7r°  (i.e.,  the  kth  time  such  that  Ut  —  1).  Since  UTk  = 
1,  then  XTk+i  =  0,  UTk+ i  =  0,  and  YTk+2  is  observed.  Hence,  the  state  is  known  perfectly 
at  Tk  +  1  and  the  observations  {yr*+2  :  A  =  1,2,...}  form  an  independent  identically 
distributed  (i.i.d.)  sequence  of  Bernoulli  random  variables,  with  {Yrk+2  =  j }  =  o), 

j  =  0, 1,  where 

Aj(0)  :=  (1  -  6)(  1  -  q)  +  6q  =  1  -  Ao (6) .  (4.3) 
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This  sequence  provides  information  about  the  transition  from  XTk+i  =  0  to  XTk+2,  and 
thus  can  be  used  to  estimate  $o-  Define  Yk  :=  YT k+2-  The  sequence  is  i.i.d.,  its 

distribution  depending  only  on  the  true  parameter  6$  and  the  reliability  of  the  measuring 
device  ( q ). 

Note  that  by  Theorem  4.2  and  the  strong  law  of  large  numbers  we  have  that 


fc=i 


P*  -a.s. . 


(4.4) 


Let  0n  :=  0Tn+2.  Then,  setting 

n 

k=  i 

where  Ai(-)  is  defined  in  (4.3),  we  obtain  a  sequence  of  strongly  consistent  parameter  esti¬ 
mates  {#„}.  Also,  a  prediction  error-based  algorithm  can  be  formulated.  Since  the  obser¬ 
vations  take  only  the  values  {0, 1},  then  the  prediction  error  in  this  case  is 


en(0)  =  Yn-  \i(0). 


(4.5) 


However,  in  order  to  have  8n  £  Qs,  a  projection  mechanism  is  required.  A  stochastic 
approximation-type  recursive  algorithm  which  is  designed  to  minimize  £,1r°[|en(0)2]  is  then 

0n+i  =  ne<  +  TTfT  ^n+lV’nfn(^n))  »  #0  €  Og,  (4.6a) 


where  the  map  ne4  is  a  projection  into  the  interior  of  Og.  Also,  Rn  can  be  computed  in 
different  ways,  e.g.  if  Rn  =  (2 q  —  l)2,  then  we  obtain  a  recursive  (and  projected)  version  of 
the  scheme  obtained  from  (4.4)  above.  We  choose  to  use 


R-n+1  —  Rn  +  7^7  (t^n  —  Rn)  > 
V’n  =  --§e  (n(0) 


Ri  =  l, 


»=8 


i  =  W  ^i(0)  -  2g  -  1. 


(4.65) 


The  following  can  then  be  shown  using  the  techniques  in  [12],  [13]. 


Theorem  4.3.  Consider  the  algorithm  (4.6).  The  sequence  converges  -a.s., 

as  n  — ►  oo,  to  the  set  of  limit  points  of  the  ODE 


9(t)  =  -R~l(t)(2q  -  lf(B(t)  -  $o) , 
R(t)  =  (2q  -  l)2  -  R(t). 


Since  8q  is  assumed  to  lie  in  the  interior  of  0$,  all  solutions  of  the  ODE  (4.7)  leave 
the  interior  of  0^  invariant  and  thus  the  projection  operator  lie*  need  not  be  considered  in 
the  averaged  equations.  It  is  straightforward  to  show  that  (4.7)  is  globally  asymptotically 
stable  with  unique  limit  point  8q.  In  the  natural  way,  we  define  8t  to  be  constant  between 
updates:  0t  '■=  1  €  {rn  +  2,  rn  +  3,...,  rn+1  +  l}.  We  thus  have  the  following  result, 

which  is  a  direct  consequence  of  Theorem  4.2. 


8 


Corollary  4.1.  Assume  q  £  (0.5,1).  Then  the  sequence  converges  to  0O,  &s  t  — >  oo, 

V*°-a.s.. 

Remark  4.1:  Let  i r  be  any  separated  policy  satisfying  ?rt(0)  =  0,  for  all  /  €  INo,  and 
VT{Ui  =  1,  i.o.}  =  1.  Then  the  results  above  will  also  hold  if  7r  is  used  instead  of  7r°. 

V.  Average  Cost  Optimality  of  the  Adaptive  Policy 

We  examine  next  the  long-run  average  performance  of  the  adaptive  policy  7r°  given  by 
(4.1)  and  (4.2).  Let  Tt  be  the  cr-algebra  generated  by  the  observations  up  to  time  t,  i.e., 
Tt  =  <T(yj, . . .  ,yt).  Note  that  { 8t }  of  Corollary  4.1  satisfies  the  following  conditions: 

(Ei)  et  is  Tt- measurable,  and  6t  €  0«,  for  all  t  £  IN0; 

(E2)  §t  -»  0O,  V*°-a.s.. 

Consider  also  the  weaker  condition: 

(E2')  6t  — >  &o,  in  probability  under  V . 

Let  be  any  sequence  of  parameter  estimates  satisfying  (El)  and  (E2');  we  will  show 

that  the  corresponding  adaptive  policy  7r°  is  self-optimizing,  i.e.,  J(-Ka,p0 )  =  T^,  for  all 
0  <  Po  <  1.  In  the  case  where  {#t}^0  satisfies  (E2),  we  will  show  the  stronger  sample  path 
result 

P"°-a.s..  (5.1) 

<=o 

The  method  we  use  to  verify  these  self-optimizing  properties  of  7r°  is  motivated  by 
techniques  in  [14]  and  [17].  However,  the  verification  here  does  not  fit  in  the  same  framework, 
due  to  (a)  discontinuity  of  7r( - ;  •)  £  OV  in  both  its  arguments  and  (b)  the  fact  that  the 
cost  c(p,u)  is  an  explicit  function  of  u.  We  have  that  T(y,  p,u]9)  is  continuous  in  6.  Using 
this  and  the  fact  that  regeneration  occurs  infinitely  often,  the  following  is  shown  in  the 
Appendix. 

Lemma  5.1.  If6t  — ►  9q,  as  t  — ►  oo,  in  probability  under  V**  (V^-a.s.),  then  \pt  —  pt\  —*  0, 
as  t  — *•  oo,  in  probability  under  V (V^’-a.s.). 

Then,  we  have  the  following. 

Theorem  5.1.  Assume  q  £  (0.5,1). 

(i)  If  satisfies  (El)  and  (E2'),  then  na  is  self-optimizing. 

(ii)  If  in  addition  {#(}^0  satisfies  (E2),  then  7r°  is  self-optimizing  in  a  sample-path  sense, 
i.e.,  (5.1)  holds. 
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Proof:  (i)  Let  4>tf(-  ,•)  denote  Mandl’s  discrepancy  function,  corresponding  to  the  param¬ 
eter  value  6  6  0$,  i.e.,  for  p  €  [0, 1]  and  u  €  U 

**(/>,«)  ==  c(p,u)+  ^P(j/,/>,ti;0)/ifi(T(y,p,u;0))  -  TJ  -  h6{p). 

V€Y 

Then  by  (2.5)-(2.8),  Corollary  3.1  and  Theorem  4.1,  $«(p,  u)  is  continuous  in  both  p  €  [0, 1] 
and  0  €  Os-  Furthermore,  since  0f  is  compact,  then  $g(p,u)  is  uniformly  continuous  and 
bounded  in  ( p,0 )  6  [0,1]  x  0^;  thus,  $j(/5t,u)  is  uniformly  integrable,  for  each  u  €  U. 
Therefore,  for  each  u  €  U,  we  have 

|*i,(Pt,tO -**»(/>*, «)|  ^0, 

and  since  U  is  finite, 

£'■{*•«(*,»(*;**))}  t— .  0,  (5.2) 

where  we  used  the  fact  that  $$(/>*,  ?r(pt;0<))  =  0,  since  £  OV  minimizes  the  opti¬ 

mality  equation  (3.1),  for  the  parameter  value  0  6  Os-  The  result  then  follows  from  (5.2); 
see  [2],  [10],  [14],  [17]. 

(ii)  If  the  convergence  is  in  the  stronger  T^-a-s.  sense,  then  similarly  as  above,  we  obtain 
that 

*e0(pu*(Pt’jt))  ; — ►  0,  V*‘-a-s. , 

l— ►  OO 

from  which  the  result  follows.  □ 
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Appendix 


For  ease  of  notation,  we  will  write 


Vv(p;6):=  V(y,p,0-,8), 


Ty{P',B)  ■.=  T(y,p, O;0), 


We  quote  the  following  useful  result  from  [3]. 


(A.l) 


11 


Lemma  A.l.  Let  q  £  [0.5,1),  p  £  [0,1]  and  9  £  (0,1].  Then 

(i)  Tl(p,6)  >  T°(p,6)  and  the  inequality  is  strict  for  q  £  (0.5,1),  p  £  [0,1)  and 
0  6(0,1). 

(ii)  Tv(p, 6)  is  monotone  nondecreasing  with  respect  to  both  p  and  6. 

(m)  | v*(p,9)  -  vv(p',o')\  <  | p-  P'\  +  \6-  e'\. 

(iv)  \Ty(p,9)  -  Ty{p\8>)\  <  (^)  {(1  -  6)\p  -  p'\  +  (1  -  p’)\9  -  8' I}. 

(v)  The  iterates  of  Tl{-  ,6)  converge  uniformly  and  monotonically  to  1. 

Proof  of  Corollary  3.1:  (i)  Let  6  £  (0,1]  be  fixed.  Continuity  of  hg(-)  on  (0,1]  is 
immediate  since  hg(-)  is  concave  and  nondecreasing.  To  show  that  it  is  continuous  at  0 
observe  that  TV(O;0)  >  0,  for  y  =  0, 1,  and  thus  Ty(-;6)  maps  a  neighborhood  of  0  into 
(0,1].  Thus,  the  continuity  of  Ty(-;6)  and  Vy(-\6)  on  [0,1]  (see  Lemma  A.l)  along  with 
that  of  hg(-)  on  (0, 1]  imply  the  continuity  of  /«(•)  on  [0, 1],  which  in  turn  implies,  in  view 
of  (3.1),  that  hg(-)  is  continuous  on  [0, 1]. 

(ii)  Now  suppose  and  are  any  two  solutions  of  (3.1),  satisfying  hfg\  0)  =  h]g\o)  =  0, 
and  let  p  £  [0, 1]  satisfy 

h(e1\p)-h(g)(p)=  sup  {h{g](p)-  h[2\p)}  .  (A. 2) 

p€[0,l] 

We  distinguish  two  cases. 

First,  suppose  that  h^\p)  /  jR  —  Tg.  With  /^(-)  and  fg2\-)  suitably  defined,  we 
obtain 

hit\p)  ~  h(e\p)  =  minj/^^p),  -  /J2)(p)  <  /^(p)  -  /J2)(p) 
i 

=  £  V’(in e)  {C(T»(r, «))  -  a<!>  e)) } . 

y=0 

Since  V°(p;0)  +  V1(p;6)  =  1  and  Vl{p\6)  >  0,  we  conclude  that  h^\p)  -  h^\p)  = 
(T1  (p;  6))  —  hg2^  (T:(p;  0)),  and  thus  (A. 2)  still  holds  if  we  replace  p  with  Tl(p\0).  By 
induction,  for  each  n  £  IN, 

h^iiT'np-e))  -  h[2)((T'ni>-,9))  =  sup  {h™(p)  -  h<2)(p)}  .  (A. 3a) 

P€[0,1) 

Second,  suppose  that  h^\p)  =  R  —  TJ.  Observe  that  h^g\p)  —  h^\p)  >  0  (since 
h^(0)  =  h(#2)(0))  and  therefore,  necessarily,  h!g\p)  =  R  —  TJ.  Invoking  the  fact  that 
hg(-)  is  nondecreasing,  we  conclude  that 

C  ((r1  )n(p;  0))  -  h[2)  ((T1  )n(p;  0))  =  0 ,  nen.  (A.36) 
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From  Lemma  A.l  (v),  (T1)n(p;^)  converges  to  1  as  n  — ►  oo.  Taking  the  limit,  as  n  — ►  oo, 
in  (A.3a)  and  (A.3b),  yields 

h(e)(l)~h{e2\l)=  sup  {h{e1](p)-  h(e2\p)}  >0.  (AA) 

p€[  0,1] 

Interchanging  the  roles  of  and  in  (A.4),  we  finally  conclude  that  □ 

We  need  the  following  definition. 

Definition  A.l:  For  n  €  IN,  let  Dn  denote  the  set  of  multi-indices  of  length  n,  in  {0,1}, 
i.e.,  dn  £  Dn  is  of  the  form  dn  =  (d(l), . . .  ,d(n)),  d(i)  £  {0,1}.  If  k  <  n  then  d k  -<  dn 
denotes  that  dk  agrees  with  the  first  k  coordinates  of  dn,  while  dk  C  dn  denotes  that  dn  is 
the  concatenation  d(  •  dk  •  dm,  for  some  multi-indices  dt  and  dm ,  with  l  +  k  +  m  =  n.  Let 
0",  n  >  0  denote  the  n-fold  product  of  the  parameter  space.  For  a  sequence  {0<}~o  C  Qs 
and  a  positive  integer  n  <  t,  we  define 

:=  (^_n+i, . . .  ,8t)  £  Of  . 

For  the  map  Ty(p\8),  defined  in  (A.l),  and  a  multi-index  dn  £  Dn,  denotes  the 

n-fold  composition  7’^n)(T<i(n-1)(.  •  •  ;8t-i);8t),  and  is  the  identity  map  if  n  —  0. 

Lemma  A. 2.  Let  8  £  (0, 1),  dn  £  Dn,  0  <  £o  <  1  and  suppose  that 

Td‘(0;8)  <  1  -  80,  Vdi  -<  dn,  t=l,...,n.  (A.5) 


Then 


(A.6) 


Proof:  With  r  :=  and  \dk\  '■=  X^,*=i  d(i),  ^  *4  dn,  we  inductively  obtain 

k  =  1, . . .  ,n .  {A.l) 

Let  pk  Td*(O;0),  and  0k  :=  r2ldtl_fc(l  -  6)k.  The  hypothesis  in  (A.5)  is  equivalent  to 


Tdl(O;0)  8  r2ld,l-‘(l  -  Sy 
1  -  Td*(O;0)  “  r2|d*|-*(i  _0)fc  1 


1=1 


Differentiating  (A. 7)  and  using  (A.8),  we  obtain, 


d_  f  pn  \  _  1  Pn  8  Efe=l(n  -  k)Pk 

d8  \1  —  pn)  8  1  Pn  '  (1  -8)0n 

<  1  Pn  i  (1  +  ij1  )Pn 
~  8  1- Pn  (1-8){1-Pn) 

_  _ Pn _ 

“  M(l-0)(1-Pn)‘ 


(A.8) 


(A.9) 
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By  Lemma  A.l  (i)  and  (ii),  we  have  the  following  estimate 

1  -  Td'( O;0)  <  1  -  T°(O;0)  <  SiLJl 

(1-9) 

which,  in  conjunction  with  (A.9)  yields  (A. 6). 

Proof  of  Lemma  5.1:  Observe  that 


Ptl[Uk  =  1  :  *<<]  €  {Tdn  (0;  0" )  :  dn  <E  Dn  ,  0  <  n  <  t  -  k  -  1}  ,  (A.10) 

and  the  analogous  relation  holds  for  pt  if  we  replace  8?  by  80  in  (A. 10).  By  Lemma  A.l 
(ii),  it  is  true  in  general  that 


(0;  min{0"})  <  Td -  (0;  8?)  <  Td *  (0;  max{«"})  •  (A.U) 

Recalling  that  0  <  8  <  80  <  1,  choose  t]  such  that  [0O  -  t},80  +  rj]  C  (6,1)  D  O^.  Since 

a(8)  <  a*  <  1  on  0^,  using  Lemma  A.l  (i)— (ii),  we  have 

Ptl[Uk  =  1,  0i  €  [80  -  ti,80  +  t?]  :  t  =  k  +  2, . . . ,f]  <  T\a*-,80  +  rj)  <  1 .  (A. 12) 

In  view  of  (A.10)-(A.12),  utilizing  Lemma  A.2  (with  60  :=  1-T’1(a*;0o4-  77)),  and  the  mean 
value  theorem,  we  conclude  that  given  e  >  0  there  exists  a  neighborhood  Ve  C  [#o  —  rj,  80  +  77] 
of  60  such  that 


I  Pt  ~  Pt\l[Vk  -  1 ,  8{  €  Ve  :  i  =  k  +  2,...,t]  < 


gdiam  (Ve) 
60(1  -  g)6 


<£, 


(A- 13) 


where  diam  (Pe)  denotes  the  diameter  of  Vc.  Now,  let  z  >  0  be  chosen.  If  8t  -*•  80,  P^-a.s., 
then,  outside  a  set  of  probability  0,  8t  ->  80  and,  by  Theorem  4.2,  Uk  =  1  for  infinitely 
many  integers  k ,  along  every  sample  path.  Consider  an  arbitrary  such  sample  path  and 
choose  integers  n0  <  nx  £  IN  such  that  8t  €  Vc  for  all  t  >  n0  and  Uni  =  1.  Then,  by  (A. 13), 
I  Pt  —  Pt\  <  £ ,  V<  >  «i,  along  the  sample  path. 

On  the  other  hand,  if  8t  — +  8q  in  probability  under  V 7r°  then  defining 


At  ■■=  {Uk  =  0  :  t  <k<t  +  n},  f,n€lN, 

and  applying  [3,  Theorem  A.2],  choose  n0  €  IN  such  that  {^-nol  <  § ,  for  all  t  >  n0. 
There  exists  an  integer  nj  €  IN,  m  >  n0,  such  that  {6?°  <£  Vc}  <  f  for  all  t  >  m. 
Therefore,  by  (A. 13), 


V^{\Pt~Pt\>e)<  t  V,}  4-  {A?lno}  <e,  Vt  >  m  , 

and  the  proof  is  complete. 
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